Background

This file is for analysis of some data downloaded from The COVID Tracking Project on 20 August, 2020. This file contains data on positive tests, hospitalizations, deaths, and the like for coronavirus cases in the US. Data are unique by state and date.

Data Availability

The downloaded data file is read in as CSV, and the date column is converted to date format:

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
cvData <- readr::read_csv("./RInputFiles/Coronavirus/CV_downloaded_200820.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   state = col_character(),
##   dataQualityGrade = col_character(),
##   lastUpdateEt = col_character(),
##   dateModified = col_datetime(format = ""),
##   checkTimeEt = col_character(),
##   dateChecked = col_datetime(format = ""),
##   hash = col_character(),
##   grade = col_logical()
## )
## See spec(...) for full column specifications.
glimpse(cvData)
## Observations: 9,449
## Variables: 53
## $ date                        <dbl> 20200820, 20200820, 20200820, 20200820,...
## $ state                       <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive                    <dbl> 5332, 112449, 54765, 0, 196280, 644751,...
## $ negative                    <dbl> 307315, 784330, 593744, 1514, 922163, 9...
## $ pending                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ hospitalizedCurrently       <dbl> 51, 1105, 499, NA, 1070, 6212, 238, 47,...
## $ hospitalizedCumulative      <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ inIcuCurrently              <dbl> NA, NA, NA, NA, 388, 1707, NA, NA, 26, ...
## $ inIcuCumulative             <dbl> NA, 1348, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently       <dbl> 6, NA, 108, NA, 233, NA, NA, NA, 12, NA...
## $ onVentilatorCumulative      <dbl> NA, 734, 488, NA, NA, NA, NA, NA, NA, N...
## $ recovered                   <dbl> 1513, 44684, 48458, NA, 28471, NA, 5759...
## $ dataQualityGrade            <chr> "A", "B", "A", "C", "A+", "B", "A", "B"...
## $ lastUpdateEt                <chr> "8/20/2020 0:00", "8/20/2020 11:00", "8...
## $ dateModified                <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ checkTimeEt                 <chr> "8/19/2020 20:00", "8/20/2020 7:00", "8...
## $ death                       <dbl> 29, 1974, 641, 0, 4684, 11686, 1800, 44...
## $ hospitalized                <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ dateChecked                 <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ totalTestsViral             <dbl> 312647, 891813, 648509, NA, 1116897, 10...
## $ positiveTestsViral          <dbl> 4970, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ negativeTestsViral          <dbl> 307360, NA, 593744, NA, NA, NA, NA, NA,...
## $ positiveCasesViral          <dbl> 5332, 107483, 54765, 0, 194734, 644751,...
## $ deathConfirmed              <dbl> 29, 1905, NA, NA, 4429, NA, NA, 3572, N...
## $ deathProbable               <dbl> NA, 69, NA, NA, 255, NA, NA, 886, NA, 6...
## $ totalTestEncountersViral    <dbl> NA, NA, NA, NA, NA, NA, 895207, NA, NA,...
## $ totalTestsPeopleViral       <dbl> NA, NA, NA, 1514, NA, NA, 645170, NA, 2...
## $ totalTestsAntibody          <dbl> NA, NA, NA, NA, 255456, NA, 150931, NA,...
## $ positiveTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 10406, NA, NA, ...
## $ negativeTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 140525, NA, NA,...
## $ totalTestsPeopleAntibody    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntigen  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsAntigen           <dbl> NA, NA, 10358, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ fips                        <dbl> 2, 1, 5, 60, 4, 6, 8, 9, 11, 10, 12, 13...
## $ positiveIncrease            <dbl> 85, 971, 549, 0, 723, 5920, 270, 118, 5...
## $ negativeIncrease            <dbl> 1713, 10462, 6680, 0, 6481, 81363, 4657...
## $ total                       <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResults            <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResultsIncrease    <dbl> 1798, 11433, 7229, 0, 7204, 87283, 7348...
## $ posNeg                      <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ deathIncrease               <dbl> 0, 30, 10, 0, 50, 163, 12, 1, 1, 0, 119...
## $ hospitalizedIncrease        <dbl> 0, 250, 47, 0, 123, 0, 3, 72, 0, 0, 450...
## $ hash                        <chr> "c83a1d575a597788adccbe170950b8d197754b...
## $ commercialScore             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
cvData <- cvData %>% 
    mutate(date=lubridate::ymd(date))
glimpse(cvData)
## Observations: 9,449
## Variables: 53
## $ date                        <date> 2020-08-20, 2020-08-20, 2020-08-20, 20...
## $ state                       <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive                    <dbl> 5332, 112449, 54765, 0, 196280, 644751,...
## $ negative                    <dbl> 307315, 784330, 593744, 1514, 922163, 9...
## $ pending                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ hospitalizedCurrently       <dbl> 51, 1105, 499, NA, 1070, 6212, 238, 47,...
## $ hospitalizedCumulative      <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ inIcuCurrently              <dbl> NA, NA, NA, NA, 388, 1707, NA, NA, 26, ...
## $ inIcuCumulative             <dbl> NA, 1348, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently       <dbl> 6, NA, 108, NA, 233, NA, NA, NA, 12, NA...
## $ onVentilatorCumulative      <dbl> NA, 734, 488, NA, NA, NA, NA, NA, NA, N...
## $ recovered                   <dbl> 1513, 44684, 48458, NA, 28471, NA, 5759...
## $ dataQualityGrade            <chr> "A", "B", "A", "C", "A+", "B", "A", "B"...
## $ lastUpdateEt                <chr> "8/20/2020 0:00", "8/20/2020 11:00", "8...
## $ dateModified                <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ checkTimeEt                 <chr> "8/19/2020 20:00", "8/20/2020 7:00", "8...
## $ death                       <dbl> 29, 1974, 641, 0, 4684, 11686, 1800, 44...
## $ hospitalized                <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ dateChecked                 <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ totalTestsViral             <dbl> 312647, 891813, 648509, NA, 1116897, 10...
## $ positiveTestsViral          <dbl> 4970, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ negativeTestsViral          <dbl> 307360, NA, 593744, NA, NA, NA, NA, NA,...
## $ positiveCasesViral          <dbl> 5332, 107483, 54765, 0, 194734, 644751,...
## $ deathConfirmed              <dbl> 29, 1905, NA, NA, 4429, NA, NA, 3572, N...
## $ deathProbable               <dbl> NA, 69, NA, NA, 255, NA, NA, 886, NA, 6...
## $ totalTestEncountersViral    <dbl> NA, NA, NA, NA, NA, NA, 895207, NA, NA,...
## $ totalTestsPeopleViral       <dbl> NA, NA, NA, 1514, NA, NA, 645170, NA, 2...
## $ totalTestsAntibody          <dbl> NA, NA, NA, NA, 255456, NA, 150931, NA,...
## $ positiveTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 10406, NA, NA, ...
## $ negativeTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 140525, NA, NA,...
## $ totalTestsPeopleAntibody    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntigen  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsAntigen           <dbl> NA, NA, 10358, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ fips                        <dbl> 2, 1, 5, 60, 4, 6, 8, 9, 11, 10, 12, 13...
## $ positiveIncrease            <dbl> 85, 971, 549, 0, 723, 5920, 270, 118, 5...
## $ negativeIncrease            <dbl> 1713, 10462, 6680, 0, 6481, 81363, 4657...
## $ total                       <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResults            <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResultsIncrease    <dbl> 1798, 11433, 7229, 0, 7204, 87283, 7348...
## $ posNeg                      <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ deathIncrease               <dbl> 0, 30, 10, 0, 50, 163, 12, 1, 1, 0, 119...
## $ hospitalizedIncrease        <dbl> 0, 250, 47, 0, 123, 0, 3, 72, 0, 0, 450...
## $ hash                        <chr> "c83a1d575a597788adccbe170950b8d197754b...
## $ commercialScore             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
cvData %>%
    select(date, state) %>%
    anyDuplicated()
## [1] 0

As expected, the file is unique by date and state. The date field has been converted from double to date. The main columns of interest will be:

  • date
  • state
  • positiveIncrease - number of new positive cases for state during date
  • deathIncrease - number of new deaths for state during date

A smaller frame containing only this data is created:

cvUse <- cvData %>%
    select(date, state, cases=positiveIncrease, deaths=deathIncrease)

cvUse %>%
    summarize_if(is.numeric, sum)
## # A tibble: 1 x 2
##     cases deaths
##     <dbl>  <dbl>
## 1 5546056 166127

The numeric totals match those reported by the COVID Tracking Project for the same date. They are roughly 5% lower than the totals reported by worldometers.info. There are significant issues associated with official reporting for corornavirus, and a 5% discrepancy between sources is not unexpected.

The data are next checked for totals by state and by week:

cvUse %>%
    group_by(state) %>%
    summarize_if(is.numeric, sum) %>%
    pivot_longer(-state) %>%
    ggplot(aes(x=fct_reorder(state, value), y=value)) + 
    geom_col(fill="lightblue") +
    coord_flip() + 
    facet_wrap(~name, scales="free_x") + 
    labs(x="", y="", title="Coronavirus cases and deaths by state through August 20, 2020")

cvUse %>%
    group_by(week=lubridate::epiweek(date)) %>%
    summarize_if(is.numeric, sum) %>%
    pivot_longer(-week) %>%
    ggplot(aes(x=week, y=value)) + 
    geom_line() +
    facet_wrap(~name, scales="free_y") + 
    labs(x="", y="", title="Coronavirus cases and deaths by epidemiological week through August 20, 2020")

Sort order by state generally matches published reports of coronavirus burden by state. The weekly data appear broadly aligned with other published data. The dip in the final week is due to only 5 of the 7 days of the week being included in the Thursday data file.

State population data (2015 estimates) are obtained from usmap for converting metrics to per capita, and the cvData file is filtered to only those observations contained in the state population file:

statePop <- usmap::statepop %>%
    select(state=abbr, name=full, pop_2015)
glimpse(statePop)
## Observations: 51
## Variables: 3
## $ state    <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL"...
## $ name     <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", ...
## $ pop_2015 <dbl> 4858979, 738432, 6828065, 2978204, 39144818, 5456574, 3590...
cvUse <- cvUse %>%
    semi_join(statePop)
## Joining, by = "state"
cvUse %>%
    summarize_if(is.numeric, sum)
## # A tibble: 1 x 2
##     cases deaths
##     <dbl>  <dbl>
## 1 5516295 165742

Over 99% of the cases and deaths are preserved when the territories and other non-state data are removed.

There appear to be at least two peaks in the case and death data, likely driven by different locales experiencing outbreaks at different times. Per capita cases and deaths by state are plotted:

# Per capita metrics by state
cvStatePerCapita <- cvUse %>%
    group_by(state) %>%
    summarize_if(is.numeric, sum) %>%
    inner_join(statePop) %>%
    mutate(cases=1000000*cases/pop_2015, deaths=1000000*deaths/pop_2015)
## Joining, by = "state"
# Disease burden by state, per capita
cvStatePerCapita %>%
    ggplot(aes(x=cases, y=deaths)) +
    geom_text(aes(label=state)) + 
    labs(x="Cases per million through August 20", y="Deaths per million through August 20")

States will be defined as having a low impact of disease if 1) deaths per million are 100 or less, and 2) cases per million are 10000 or less:

lowBurden <- cvStatePerCapita %>%
    filter(deaths <= 100, cases <= 10000) %>%
    pull(state)

Next, the states that are not defined as low burden are hierarchically clustered, using total deaths per capita by week. Due to very significant expansions in testing volume both by state and within state over time, death data is likely more representative of disease burden by time than cases data. Deaths per capita by state by month are capped at 300 since otherwise the distance between the extremely high states (which is not so meaningful here) dominates the differences in early vs. late disease bruden:

# Calculate the raw data
clustData <- cvUse %>%
    filter(!(state %in% lowBurden)) %>%
    inner_join(statePop) %>% 
    mutate(month=lubridate::month(date), cpm=1000000*cases/pop_2015, dpm=1000000*deaths/pop_2015) %>% 
    filter(date >= as.Date("2020-03-15")) %>% 
    group_by(state, month) %>% 
    summarize(dpm=sum(dpm), cpm=sum(cpm), n=n()) %>% 
    pivot_wider(state, names_from=month, values_from=c(dpm, cpm)) %>%
    ungroup()
## Joining, by = "state"
# Run clusters without normalization, but with dpm limited to 300
distData <- clustData %>%
    select(state, starts_with("dpm")) %>%
    mutate_if(is.numeric, .funs=~pmin(., 300)) %>%
    column_to_rownames("state")

cvTree <- hclust(dist(distData))

# Plot the dendrogram
plot(cvTree)

There appears to be a cluster of states that had early outbreaks, a cluster of states that had later outbreaks, and a large segment that falls in between these extremes. Suppose the dendrogram is split in to three clusters, with the low burden states added as a fourth cluster:

# Get the clusters from the tree, adding the low burden states as cluster 4
cvClusters <- c(cutree(cvTree, k=3), 
                rep(4, length(lowBurden)) %>% set_names(lowBurden)
                )

# Add the clusters to the population data file
statePop <- statePop %>%
    mutate(cluster=factor(cvClusters[state]))

# Show a map of the clusters
usmap::plot_usmap(regions="states", data=statePop, values="cluster")

# Show population totals by cluster
statePop %>%
    group_by(cluster) %>%
    summarize(pop_2015=sum(pop_2015)/1000000) %>%
    ggplot(aes(x=fct_reorder(cluster, pop_2015), y=pop_2015)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=pop_2015/2, label=round(pop_2015))) + 
    labs(y="2015 population (millions)", x="Cluster", title="Population by cluster (millions)") +
    coord_flip()

# Virus by week by cluster
cvUse %>%
    mutate(cluster=factor(cvClusters[state]), week=lubridate::epiweek(date)) %>%
    group_by(cluster, week) %>%
    summarize(cases=sum(cases), deaths=sum(deaths)) %>%
    pivot_longer(-c(week, cluster)) %>%
    ggplot(aes(x=week, y=value, group=cluster, color=cluster)) + 
    geom_line() + 
    facet_wrap(~name, scales="free_y")

Metrics can be normalized by population to look at coronavirus burden per capita by segment over time:

# Integrated data file
cvWeekPop <- cvUse %>%
    mutate(week=lubridate::epiweek(date)) %>%
    inner_join(statePop, by="state")

# Summarized by date-cluster
cvDateCluster <- cvWeekPop %>%
    group_by(date, cluster) %>%
    summarize(cases=sum(cases), deaths=sum(deaths)) %>%
    inner_join(statePop %>% group_by(cluster) %>% summarize(pop_mill=sum(pop_2015)/1000000), by="cluster") %>%
    group_by(cluster) %>%
    mutate(cpm7=zoo::rollmean(cases, 7, fill=NA)/pop_mill, 
           dpm7=zoo::rollmean(deaths, 7, fill=NA)/pop_mill
           ) %>%
    ungroup()

# Plotted by date
cvDateCluster %>%
    select(date, cluster, cases=cpm7, deaths=dpm7) %>%
    pivot_longer(-c(date, cluster)) %>%
    filter(!is.na(value)) %>%
    ggplot(aes(x=date, y=value, group=cluster, color=cluster)) +
    geom_line() + 
    facet_wrap(~name, scales="free_y") + 
    labs(x="", y="Rolling 7-day mean, per million", title="Rolling 7-day mean disease burden, per million")

Broadly speaking:

  • Segment 4, which is small states with low coronavirus burden, may be experiencing a small rise in both cases and deaths
  • Segment 3, which is much of the northeast plus MI, IL, and LA has a significant spike early followed by a rapid and largely sustained decline
  • Segment 1, which is largely the south ex-LA/NM, appears to have peaked in cases by the middle of July and to be starting to decline in deaths as of the middle of August
  • Segment 2, which is about half the country, appears similar to Segment 1 but with a much less pronounced spike in both cases and deaths in July/August

There appear to be meaningful differences in disease burden over time, and with a meaningful geographical explanatory component.

Next, the total volume of disease through August 20 is explored by state:

varMapper <- c("cases"="Cases through Aug 20", 
               "newCases"="Increase in cases, 30 days through Aug 20",
               "casesroll7"="Rolling 7-day mean cases, through Aug 20", 
               "deaths"="Deaths through Aug 20", 
               "newDeaths"="Increase in deaths, 30 days through Aug 20",
               "deathsroll7"="Rolling 7-day mean deaths, through Aug 20", 
               "cpm"="Cases through Aug 20 (per million)",
               "cpm7"="Cases per day (7-day rolling mean) through Aug 20 (per million)", 
               "newcpm"="Increase in cases, 30 days through Aug 20 (per million)",
               "dpm"="Deaths through Aug 20 (per million)", 
               "dpm7"="Deaths per day (7-day rolling mean) through Aug 20 (per million)", 
               "newdpm"="Increase in deaths, 30 days through Aug 20 (per million)"
               )

cvWeekPop %>%
    group_by(state, cluster) %>%
    summarize(cases=sum(cases), deaths=sum(deaths), pop_2015=mean(pop_2015)) %>%
    ungroup() %>%
    mutate(cpm=1000000*cases/pop_2015, dpm=1000000*deaths/pop_2015) %>%
    select(state, cluster, cases, deaths) %>%
    pivot_longer(c(cases, deaths)) %>%
    ggplot(aes(x=fct_reorder(state, value, .fun=min), y=value)) + 
    geom_col(aes(fill=cluster)) + 
    coord_flip() + 
    labs(x="", y="", title="Coronavirus impact by state through August 20, 2020") + 
    facet_wrap(~varMapper[name], scales="free_x")

cvWeekPop %>%
    group_by(state, cluster) %>%
    summarize(cases=sum(cases), deaths=sum(deaths), pop_2015=mean(pop_2015)) %>%
    ungroup() %>%
    mutate(cpm=1000000*cases/pop_2015, dpm=1000000*deaths/pop_2015) %>%
    select(state, cluster, cpm, dpm) %>%
    pivot_longer(c(cpm, dpm)) %>%
    ggplot(aes(x=fct_reorder(state, value, .fun=min), y=value)) + 
    geom_col(aes(fill=cluster)) + 
    coord_flip() + 
    labs(x="", y="", title="Coronavirus impact by state through August 20, 2020") + 
    facet_wrap(~varMapper[name], scales="free_x")

As expected, the segmentation approach has largely divided the states by total coronavirus burden. Mississippi and Arizona are in segment 1 due to the late nature of their outbreak.

Further, the data are explored for a combination of total disease burden and change over the past 30 days:

cvWeekPop %>%
    mutate(newCases=ifelse(as.Date("2020-08-21")-date <= 30, cases, 0), 
           newDeaths=ifelse(as.Date("2020-08-21")-date <= 30, deaths, 0)
           ) %>%
    group_by(state, cluster) %>%
    summarize(cases=sum(cases), 
              deaths=sum(deaths), 
              newCases=sum(newCases),
              newDeaths=sum(newDeaths),
              pop_2015=mean(pop_2015)
              ) %>%
    ungroup() %>%
    mutate(cpm=1000000*cases/pop_2015, 
           dpm=1000000*deaths/pop_2015,
           newcpm=1000000*newCases/pop_2015, 
           newdpm=1000000*newDeaths/pop_2015
           ) %>%
    ggplot(aes(x=cpm, y=newcpm)) + 
    geom_text(aes(color=cluster, label=state)) + 
    labs(x=varMapper["cpm"], 
         y=varMapper["newcpm"], 
         title="Coronavirus impact by state through August 20, 2020"
         ) + 
    geom_abline(lty=2, slope=c(0.5)) + 
    lims(x=c(0, NA), y=c(0, NA)) +
    annotate("text", x=18000, y=11000, label="50% of total cases\nin last 30 days", hjust=1) + 
    annotate("segment", x=18500, y=10500, xend=20000, yend=10000, arrow=arrow(), lty=2)

cvWeekPop %>%
    mutate(newCases=ifelse(as.Date("2020-08-21")-date <= 30, cases, 0), 
           newDeaths=ifelse(as.Date("2020-08-21")-date <= 30, deaths, 0)
           ) %>%
    group_by(state, cluster) %>%
    summarize(cases=sum(cases), 
              deaths=sum(deaths), 
              newCases=sum(newCases),
              newDeaths=sum(newDeaths),
              pop_2015=mean(pop_2015)
              ) %>%
    ungroup() %>%
    mutate(cpm=1000000*cases/pop_2015, 
           dpm=1000000*deaths/pop_2015,
           newcpm=1000000*newCases/pop_2015, 
           newdpm=1000000*newDeaths/pop_2015
           ) %>%
    ggplot(aes(x=dpm, y=newdpm)) + 
    geom_text(aes(color=cluster, label=state)) + 
    labs(x=varMapper["dpm"], 
         y=varMapper["newdpm"], 
         title="Coronavirus impact by state through August 20, 2020"
         ) + 
    geom_abline(lty=2, slope=c(0.5)) + 
    lims(x=c(0, NA), y=c(0, NA)) +
    annotate("text", x=250, y=200, label="50% of total deaths\nin last 30 days", hjust=1) + 
    annotate("segment", x=250, y=200, xend=400, yend=200, arrow=arrow(), lty=2)

The clusters appear relatively well separated, with the possible exception of Louisiana which is arguably quite close to cluster 1. Cluster 3 stands out as having had a very high overall impact, but with not much of an increase in the past 30 days.

The individual trends by state are also plotted, smoothed by week:

cvWeekPop %>%
    rbind(mutate(., state="cluster")) %>%
    group_by(state, cluster, date) %>%
    summarize_if(is.numeric, sum) %>%
    ungroup() %>%
    mutate(cpm=1000000*cases/pop_2015, dpm=1000000*deaths/pop_2015) %>%
    group_by(state, cluster) %>%
    mutate(cpm7=zoo::rollmean(cpm, k=7, fill=NA), dpm7=zoo::rollmean(dpm, k=7, fill=NA)) %>%
    ungroup() %>%
    filter(!is.na(cpm7)) %>%
    ggplot(aes(x=date, y=cpm7)) + 
    geom_line(data=~filter(., state != "cluster"), aes(group=state), alpha=0.25) + 
    geom_line(data=~filter(., state == "cluster"), aes(group=state, color=cluster), lwd=1.5) + 
    facet_wrap(~cluster, scales="free_y") + 
    labs(x="", 
         y=varMapper["cpm"], 
         title="Cases per million per day (rolling 7-day mean) by state and cluster", 
         subtitle="Caution that each facet has its own y axis with different scales"
         ) + 
    ylim(c(0, NA))

cvWeekPop %>%
    rbind(mutate(., state="cluster")) %>%
    group_by(state, cluster, date) %>%
    summarize_if(is.numeric, sum) %>%
    ungroup() %>%
    mutate(cpm=1000000*cases/pop_2015, dpm=1000000*deaths/pop_2015) %>%
    group_by(state, cluster) %>%
    mutate(cpm7=zoo::rollmean(cpm, k=7, fill=NA), dpm7=zoo::rollmean(dpm, k=7, fill=NA)) %>%
    ungroup() %>%
    filter(!is.na(dpm7)) %>%
    ggplot(aes(x=date, y=dpm7)) + 
    geom_line(data=~filter(., state != "cluster"), aes(group=state), alpha=0.25) + 
    geom_line(data=~filter(., state == "cluster"), aes(group=state, color=cluster), lwd=1.5) + 
    facet_wrap(~cluster, scales="free_y") + 
    labs(x="", 
         y=varMapper["dpm"], 
         title="Deaths per million per day (rolling 7-day mean) by state and cluster", 
         subtitle="Caution that each facet has its own y axis with different scales"
         ) + 
    ylim(c(0, NA))

With a few exceptions in a rather noisy segment 2 (as well as Louisiana in segment 3), states seem to broadly follow the disease state pattern for their cluster, though with some differences in magnitude and timing.

The process is converted to functional form so that it can be run using different data. First, a function is written to read in the data:

# Function to read in the raw coronavirus data file (assume it is already downloaded)
readCVData <- function(fileName,
                       showGlimpse=TRUE,
                       uqVars=c("date", "state"),
                       errDups=TRUE
                       ) {

    # FUNCTION ARGUMENTS
    # fileName: location of the downloded CSV file from COVID Tracking Project
    # showGlimpse: boolean, whether to run glimpse() on the file
    # uqVars: variables that the file is expected to be unique by
    # errDups: boolean, whether to error out if uniqueness is violated
    
    # Read in the file and convert the 'date' from double to date
    cvData <- readr::read_csv(fileName) %>%
        mutate(date=lubridate::ymd(date))
    
    # See a sample of the data
    if (showGlimpse) glimpse(cvData)

    # Check that the data are unique by date and state
    nDups <- cvData %>%
        select_at(vars(all_of(uqVars))) %>%
        anyDuplicated()
    
    # Inform of the uniqueness check results
    if (nDups==0) {
        cat("\nFile is unique by:", uqVars, "and has dimensions:", dim(cvData), "\n")
    } else {
        cat("\nUniqueness check failed, file has duplicates by:", uqVars, "\n")
        if (errDups) stop("Fix and re-run")
    }
    
    # Return the file
    cvData
    
}

cvFull <- readCVData("./RInputFiles/Coronavirus/CV_downloaded_200820.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   state = col_character(),
##   dataQualityGrade = col_character(),
##   lastUpdateEt = col_character(),
##   dateModified = col_datetime(format = ""),
##   checkTimeEt = col_character(),
##   dateChecked = col_datetime(format = ""),
##   hash = col_character(),
##   grade = col_logical()
## )
## See spec(...) for full column specifications.
## Observations: 9,449
## Variables: 53
## $ date                        <date> 2020-08-20, 2020-08-20, 2020-08-20, 20...
## $ state                       <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive                    <dbl> 5332, 112449, 54765, 0, 196280, 644751,...
## $ negative                    <dbl> 307315, 784330, 593744, 1514, 922163, 9...
## $ pending                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ hospitalizedCurrently       <dbl> 51, 1105, 499, NA, 1070, 6212, 238, 47,...
## $ hospitalizedCumulative      <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ inIcuCurrently              <dbl> NA, NA, NA, NA, 388, 1707, NA, NA, 26, ...
## $ inIcuCumulative             <dbl> NA, 1348, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently       <dbl> 6, NA, 108, NA, 233, NA, NA, NA, 12, NA...
## $ onVentilatorCumulative      <dbl> NA, 734, 488, NA, NA, NA, NA, NA, NA, N...
## $ recovered                   <dbl> 1513, 44684, 48458, NA, 28471, NA, 5759...
## $ dataQualityGrade            <chr> "A", "B", "A", "C", "A+", "B", "A", "B"...
## $ lastUpdateEt                <chr> "8/20/2020 0:00", "8/20/2020 11:00", "8...
## $ dateModified                <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ checkTimeEt                 <chr> "8/19/2020 20:00", "8/20/2020 7:00", "8...
## $ death                       <dbl> 29, 1974, 641, 0, 4684, 11686, 1800, 44...
## $ hospitalized                <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ dateChecked                 <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ totalTestsViral             <dbl> 312647, 891813, 648509, NA, 1116897, 10...
## $ positiveTestsViral          <dbl> 4970, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ negativeTestsViral          <dbl> 307360, NA, 593744, NA, NA, NA, NA, NA,...
## $ positiveCasesViral          <dbl> 5332, 107483, 54765, 0, 194734, 644751,...
## $ deathConfirmed              <dbl> 29, 1905, NA, NA, 4429, NA, NA, 3572, N...
## $ deathProbable               <dbl> NA, 69, NA, NA, 255, NA, NA, 886, NA, 6...
## $ totalTestEncountersViral    <dbl> NA, NA, NA, NA, NA, NA, 895207, NA, NA,...
## $ totalTestsPeopleViral       <dbl> NA, NA, NA, 1514, NA, NA, 645170, NA, 2...
## $ totalTestsAntibody          <dbl> NA, NA, NA, NA, 255456, NA, 150931, NA,...
## $ positiveTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 10406, NA, NA, ...
## $ negativeTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 140525, NA, NA,...
## $ totalTestsPeopleAntibody    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntigen  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsAntigen           <dbl> NA, NA, 10358, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ fips                        <dbl> 2, 1, 5, 60, 4, 6, 8, 9, 11, 10, 12, 13...
## $ positiveIncrease            <dbl> 85, 971, 549, 0, 723, 5920, 270, 118, 5...
## $ negativeIncrease            <dbl> 1713, 10462, 6680, 0, 6481, 81363, 4657...
## $ total                       <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResults            <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResultsIncrease    <dbl> 1798, 11433, 7229, 0, 7204, 87283, 7348...
## $ posNeg                      <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ deathIncrease               <dbl> 0, 30, 10, 0, 50, 163, 12, 1, 1, 0, 119...
## $ hospitalizedIncrease        <dbl> 0, 250, 47, 0, 123, 0, 3, 72, 0, 0, 450...
## $ hash                        <chr> "c83a1d575a597788adccbe170950b8d197754b...
## $ commercialScore             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## 
## File is unique by: date state and has dimensions: 9449 53

Next, a function selects only the key variables of interest, filters to include only states (plus DC), and reports on relevant control totals:

# Function to select relevant variables and observations, and report on control totals
processCVData <- function(dfFull, 
                          varsKeep=c("date", "state", "positiveIncrease", "deathIncrease"), 
                          varsRename=c("positiveIncrease"="cases", "deathIncrease"="deaths"), 
                          stateList=c(state.abb, "DC")
                          ) {
    
    # FUNCTION ARGUMENTS
    # dfFull: the full data file originally loaded
    # varsKeep: variables to keep from the full file
    # varsRename: variables to be renamed, using a named vector of form originalName=newName
    # stateList: variables for filtering state (NULL means do not run any filters)
    
    # Select only the key variables
    df <- dfFull %>%
        select_at(vars(all_of(varsKeep)))
    
    # Apply the renaming of variables
    names(df) <- ifelse(is.na(varsRename[names(df)]), names(df), varsRename[names(df)])

    # Designate each record as being either a valid state or not
    if (!is.null(stateList)) {
        df <- df %>%
            mutate(validState=state %in% stateList)
    } else {
        df <- df %>%
            mutate(validState=TRUE)
    }
    
    # Summarize the control totals for the data, based on whether the state is valid
    cat("\n\nControl totals - note that validState other than TRUE will be discarded\n\n")
    df %>%
        mutate(n=1) %>%
        group_by(validState) %>%
        summarize_if(is.numeric, sum) %>%
        print()
    
    # Return the file, filtered to where validState is TRUE, and deleting variable validState
    df %>%
        filter(validState) %>%
        select(-validState)

}

cvFiltered <- processCVData(cvFull)
## 
## 
## Control totals - note that validState other than TRUE will be discarded
## 
## # A tibble: 2 x 4
##   validState   cases deaths     n
##   <lgl>        <dbl>  <dbl> <dbl>
## 1 FALSE        29761    385   790
## 2 TRUE       5516295 165742  8659

Next, a state population data is processed for future use:

# Function to extract and format key state data
getStateData <- function(df=usmap::statepop, 
                         renameVars=c("abbr"="state", "full"="name", "pop_2015"="pop"), 
                         keepVars=c("state", "name", "pop")
                         ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing state data
    # renameVars: variables to be renamed, using named list with format "originalName"="newName"
    # keepVars: variables to be kept in the final file
    
    # Rename variables where appropriate
    names(df) <- ifelse(is.na(renameVars[names(df)]), names(df), renameVars[names(df)])
    
    # Return file with only key variables kept
    df %>%
        select_at(vars(all_of(keepVars)))
    
}

stateData <- getStateData()

Next, helper functions are written to convert a variable to per capita, or to convert a variable to a “rolling” mean:

# Helper function to create per capita metrics
helperPerCapita <- function(df, 
                            origVar, 
                            newName,
                            byVar="state",
                            popVar="pop",
                            popData=stateData,
                            mult=1000000
                            ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame currently being processed
    # origVar: the variables to be converted to per capita
    # newName: the new per capita variable name
    # byVar: the variable that will be merged by
    # popVar: the name of the population variable in the popData file
    # popData: the file containing the population data
    # mult: the multiplier, so that the metric is "per mult people"
    
    # Create the per capita variable
    df %>%
        inner_join(select_at(popData, vars(all_of(c(byVar, popVar)))), by=byVar) %>%
        mutate(!!newName:=mult*get(origVar)/get(popVar)) %>%
        select(-all_of(popVar))
    
}


# Helper function to create rolling aggregates
helperRollingAgg <- function(df, 
                             origVar, 
                             newName,
                             func=zoo::rollmean,
                             k=7, 
                             fill=NA, 
                             ...
                             ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the data
    # origVar: the original data column name
    # newName: the new variable column name
    # func: the function to be applied (zoo::rollmean will be by far the most common)
    # k: the periodicity (k=7 is rolling weekly data)
    # fill: how to fill leading.trailing data to maintain the same vector lengths
    # ...: any other arguments to be passed to func
    
    # Create the appropriate variable
    df %>%
        mutate(!!newName:=func(get(origVar), k=k, fill=fill, ...))
    
}



# Function to add per capita and rolling to the base data frame
helperMakePerCapita <- function(df, 
                                k=7
                                ) {
    
    # FUNCTION ARGUMENTS:
    # df: the initial data frame for conversion
    # k: the rolling time period to use
    
    # Create the variables for cpm, dpm, cpm7, and dpm7
    dfNew <- df %>%
        helperPerCapita(origVar="cases", newName="cpm") %>%
        helperPerCapita(origVar="deaths", newName="dpm") %>%
        group_by(state) %>%
        arrange(date) %>%
        helperRollingAgg(origVar="cpm", newName=paste0("cpm", k), k=k) %>%
        helperRollingAgg(origVar="dpm", newName=paste0("dpm", k), k=k) %>%
        ungroup()

    # Return the new data frame
    dfNew
    
}


# Create the variables for cpm, dpm, cpm7, and dpm7
cvFilteredPerCapita <- helperMakePerCapita(cvFiltered, k=7)
cvFilteredPerCapita
## # A tibble: 8,659 x 8
##    date       state cases deaths   cpm   dpm    cpm7  dpm7
##    <date>     <chr> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>
##  1 2020-01-22 WA        0      0 0         0 NA         NA
##  2 2020-01-23 WA        0      0 0         0 NA         NA
##  3 2020-01-24 WA        0      0 0         0 NA         NA
##  4 2020-01-25 WA        0      0 0         0  0          0
##  5 2020-01-26 WA        0      0 0         0  0.0199     0
##  6 2020-01-27 WA        0      0 0         0  0.0199     0
##  7 2020-01-28 WA        0      0 0         0  0.0199     0
##  8 2020-01-29 WA        1      0 0.139     0  0.0398     0
##  9 2020-01-30 WA        0      0 0         0  0.0797     0
## 10 2020-01-31 WA        0      0 0         0  0.0996     0
## # ... with 8,649 more rows

Next, a function is written for creating side-by-side cases and death bar plots:

# Function to create side-by-side plots for a deaths and cases metric
# Data in df will be aggregated to be unique by byVar using aggFunc
helperBarDeathsCases <- function(df, 
                                 numVars,
                                 title="",
                                 xVar="state",
                                 fillVar=NULL,
                                 aggFunc=sum, 
                                 mapper=varMapper
                                 ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the data
    # numVars: the relevant numeric variables for plotting
    # title: plot title, default is nothing
    # xVar: the x-axis variable for plotting
    # fillVar: the variable that will color the bars in the final plot (NULL means use "lightblue" for all)
    # aggFunc: the aggregate function (will be applied to create data unique by byVar)
    # mapper: mapping file to convert x/y variables to descriptive axes (named vector "variable"="label")
    
    # OVERALL FUNCTION PROCESS:
    # 1.  Variables in numVar are aggregated by aggFunc to be unique by c(xVar, fillVar)
    # 2.  Data are pivoted longer
    # 3.  Bar charts are created, with coloring by fillVar if provided
    
    # Create the byVar for summing
    byVar <- xVar
    if (!is.null(fillVar)) { byVar <- c(byVar, fillVar) }
    
    # Process the data and create the plot
    p1 <- df %>%
        select_at(vars(all_of(c(byVar, numVars)))) %>%
        group_by_at(vars(all_of(byVar))) %>%
        summarize_all(aggFunc) %>%
        pivot_longer(-all_of(byVar)) %>%
        ggplot(aes(x=fct_reorder(get(xVar), value, .fun=min), y=value)) + 
        coord_flip() + 
        facet_wrap(~mapper[name], scales="free_x") + 
        labs(x="", y="", title=title) + 
        if (is.null(fillVar)) geom_col(fill="lightblue") else geom_col(aes_string(fill=fillVar))
    
    # Print the plot
    print(p1)
    
}

A function is written to assess the raw state-level totals:

# Function to assess state data (no segments created yet)
assessStateData <- function(df, 
                            titleStem="Coronavirus burden by state", 
                            cfrEst=0.005,
                            mapper=varMapper
                            ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the state-level data
    # titleStem: the main title description, with (total) or (per capita) appended
    # cfrEst: the estimated case fatality rate (CFR); a dashed abline will be plotted at this slope
    # mapper: mapping file to convert x/y variables to descriptive axes (named vector "variable"="label")
    
    # Plot cases and deaths by state, once for overall and once per capita
    helperBarDeathsCases(df, numVars=c("deaths", "cases"), title=paste0(titleStem, " (total)"))
    helperBarDeathsCases(df, numVars=c("dpm", "cpm"), title=paste0(titleStem, " (per capita)"))

    # Disease burden by state, per capita
    p1 <- df %>%
        group_by(state) %>%
        summarize(cpm=sum(cpm), dpm=sum(dpm)) %>%
        ggplot(aes(x=cpm, y=dpm)) +
        geom_text(aes(label=state)) + 
        labs(x=mapper["cpm"], 
             y=mapper["dpm"], 
             title="Deaths vs. cases by state (per million people)", 
             subtitle=paste0("Dashed line is a CFR of ", 
                             round(100*cfrEst, 1), 
                             "% (states far from this may have case counting issues)"
                             )
             ) + 
        geom_abline(slope=cfrEst, lty=2)
    print(p1)
    
    # Total disease burden nationally by day, not using functional form
    p2 <- df %>%
        select(date, cases, deaths) %>%
        group_by(date) %>%
        summarize_if(is.numeric, sum) %>%
        ungroup() %>%
        helperRollingAgg(origVar="cases", newName="casesroll7") %>%
        helperRollingAgg(origVar="deaths", newName="deathsroll7") %>%
        select(-cases, -deaths) %>%
        pivot_longer(-date) %>%
        filter(!is.na(value)) %>%
        ggplot(aes(x=date, y=value)) + 
        geom_line() +
        facet_wrap(~varMapper[name], scales="free_y") + 
        labs(x="", 
             y="",
             title=titleStem
             )
    print(p2)
    
}


# State-level assessments
assessStateData(cvFilteredPerCapita)

Next, functions for creating and assessing clusters are created. The approach can use either hierarchical clustering or k-means, and focus on the following variables:

  • Aggregate cases
  • Aggregate deaths
  • Shape of case curve (percentage by week/month)
  • Shape of death curve (percentage by week/month)

Cases is a tricky clustering variable since detection rates were in the single-digit percentages early in the outbreak (estimates of ~50x as many infected as diagnosed). As testing volumes increased, it is likely that a greater percentage of cases are diagnosed. States that have later outbreaks appear to have many more cases per capita but with a lower death rate per capita:

# Function to create an elbow plot for various numbers of clusters in the data
helperElbow <- function(mtx, 
                        testCenters, 
                        iter.max, 
                        nstart, 
                        silhouette=FALSE
                        ) {
    
    # FUNCTION ARGUMENTS:
    # mtx: a numeric matrix, or an object that can be coerced to a numeric matrix (no character fields)
    # testCenters: integer vector for the centers to be tested
    # iter.max: parameter passed to kmeans
    # nstart: parameter passed to kmeans
    # silhouette: whether to calculate the silhouette score
    
    # Create an object for storing tot.withinss and silhouetteScore
    totWithin <- vector("numeric", length(testCenters))
    silhouetteScore <- vector("numeric", length(testCenters))
    
    # Create the distancing data (required for silhouette score)
    if (silhouette) distData <- dist(mtx)
    
    # Run k-means for every value in testCenters, and store $tot.withinss (and silhouetteScore, if requested)
    n <- 1
    for (k in testCenters) {
        km <- kmeans(mtx, centers=k, iter.max=iter.max, nstart=nstart)
        totWithin[n] <- km$tot.withinss
        if (silhouette & (k > 1)) silhouetteScore[n] <- mean(cluster::silhouette(km$cluster, distData)[, 3])
        n <- n + 1
    }
    
    # Create the elbow plot
    p1 <- tibble::tibble(n=testCenters, wss=totWithin) %>%
        ggplot(aes(x=n, y=wss)) + 
        geom_point() + 
        geom_line() + 
        geom_text(aes(y=wss + 0.05*max(totWithin), x=n+0.2, label=round(wss, 1))) + 
        labs(x="Number of segments", y="Total Within Sum-Squares", title="Elbow plot") + 
        ylim(c(0, NA))
    
    # Create the silhouette plot if requested
    if (silhouette) {
        p2 <- tibble::tibble(n=testCenters, ss=silhouetteScore) %>%
            ggplot(aes(x=n, y=ss)) + 
            geom_point() + 
            geom_line() + 
            geom_text(aes(y=ss + 0.05*max(silhouetteScore), x=n+0.2, label=round(ss, 1))) + 
            labs(x="Number of segments", y="Mean silhouette width", title="Silhouette plot") + 
            ylim(c(-1, NA))
        gridExtra::grid.arrange(p1, p2, nrow=1)
    } else {
        print(p1)
    }
    
}


# Function to create clusters for the state data (requires all data from same year, as currently true)
clusterStates <- function(df, 
                          caseVar="cpm", 
                          deathVar="dpm",
                          shapeFunc=lubridate::month, 
                          minShape=NULL, 
                          minDeath=0,
                          minCase=0,
                          ratioTotalvsShape=1, 
                          ratioDeathvsCase=1, 
                          hierarchical=TRUE, 
                          hierMethod="complete", 
                          nCenters=3, 
                          iter.max=10,
                          nstart=1,
                          testCenters=NULL,
                          returnList=FALSE, 
                          seed=NULL
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing cases and deaths data
    # caseVar: the variable containing the cases per capita data
    # deathVar: the variable containing the deaths per capita data
    # shapeFunc: the function to be used for creating the shape of the curve
    # minShape: the minimum value to be used for shape (to avoid very small amounts of data in Jan/Feb)
    #           NULL means keep everything
    # minDeath: use this value as a floor for the death metric when calculating shape
    # minCase: use this metric as a floor for the case metric when calculating shape
    # ratioTotalvsShape: amount of standard deviation to be kept in total variable vs shape variables
    # ratioDeathvsCase: amount of standard deviation to be kept in deaths vs cases 
    #                   (total death data will be scaled to have sd this many times higher than cases)
    #                   (death percentages by time period will be scaled directly by this amount)
    # hierarchical: boolean, if TRUE run hierarchical clustering, otherwise run k-means clustering
    #               only hierarchical clustering is currently implemented
    # hierMethod: the method for hierarchical clustering (e.g., 'complete' or 'single')
    # nCenters: the number of centers to use for kmeans clustering
    # testCenters: integer vector of centers to test (will create an elbow plot); NULL means do not test
    # iter.max: maximumum number of kmeans iterations (default in kmeans algorithm is 10)
    # nstart: number of random sets chosen for kmeans (default in kmeans algorithm is 1)
    # returnList: boolean, if FALSE just the cluster object is returned
    #                      if TRUE, a list is returned with dfCluster and the cluster object
    # seed: set the seed to this value (NULL means no seed)
    
    # Extract key information (aggregates and by shapeFunc for each state)
    df <- df %>%
        select_at(vars(all_of(c("date", "state", caseVar, deathVar)))) %>%
        purrr::set_names(c("date", "state", "cases", "deaths")) %>%
        mutate(timeBucket=shapeFunc(date)) %>%
        group_by(state, timeBucket) %>%
        summarize(cases=sum(cases), deaths=sum(deaths)) %>%
        ungroup()
    
    # Limit to only relevant time buckets if requested
    if (!is.null(minShape)) {
        df <- df %>%
            filter(timeBucket >= minShape)
    }
    
    # Extract an aggregate by state, scaled so that they have the proper ratio
    dfAgg <- df %>%
        group_by(state) %>%
        summarize(totalCases=sum(cases), totalDeaths=sum(deaths)) %>%
        ungroup() %>%
        mutate(totalDeaths=ratioDeathvsCase*totalDeaths*sd(totalCases)/sd(totalDeaths))
    
    # Extract the percentages (shapes) by month, scaled so that they have the proper ratio
    dfShape <- df %>%
        pivot_longer(-c(state, timeBucket)) %>%
        group_by(state, name) %>%
        mutate(tot=pmax(sum(value), ifelse(name=="deaths", minDeath, minCase)), 
               value=ifelse(name=="deaths", ratioDeathvsCase, 1) * value / tot) %>%
        select(-tot) %>%
        pivot_wider(state, names_from=c(name, timeBucket), values_from=value) %>%
        ungroup()
    
    # Function to calculate SD of a subset of columns
    calcSumSD <- function(df) {
        df %>% ungroup() %>% select(-state) %>% summarize_all(.funs=sd) %>% as.vector() %>% sum()
    }
    # Down-weight the aggregate data so that there is the proper sum of sd in aggregates and shapes
    aggSD <- calcSumSD(dfAgg)
    shapeSD <- calcSumSD(dfShape)
    dfAgg <- dfAgg %>%
        mutate_if(is.numeric, ~. * ratioTotalvsShape * shapeSD / aggSD)

    # Combine so there is one row per state
    dfCluster <- dfAgg %>%
        inner_join(dfShape, by="state")
    
    # Create hierarchical segments or kmeans segments
    keyData <- dfCluster %>% column_to_rownames("state")
    if (hierarchical) {
        objCluster <-  hclust(dist(keyData), method=hierMethod)
        plot(objCluster)
    } else {
        # Create an elbow plot if testCenters is not NULL
        if (!is.null(testCenters)) {
            helperElbow(keyData, testCenters=testCenters, iter.max=iter.max, nstart=nstart, silhouette=TRUE)
        }
        # Create the kmeans cluster object, setting a seed if requested
        if (!is.null(seed)) set.seed(seed)
        objCluster <- kmeans(keyData, centers=nCenters, iter.max=iter.max, nstart=nstart)
        cat("\nCluster means and counts\n")
        n=objCluster$size %>% cbind(objCluster$centers) %>% round(2) %>% t() %>% print()
    }

    # Return the data and object is a list if returnList is TRUE, otherwise return only the clustering object
    if (!isTRUE(returnList)) {
        objCluster
    } else {
        list(objCluster=objCluster, dfCluster=dfCluster)
    }
    
}

# Test clusters that weight deaths heavily vs. cases and that weight shape more highly than total
testCluster <- clusterStates(cvFilteredPerCapita, 
                             minShape=3, 
                             ratioDeathvsCase = 5, 
                             ratioTotalvsShape = 0.5, 
                             minDeath=100, 
                             minCase=10000
                             )

The clusters can then be assessed against several criteria:

# Helper function to assess 30-day change vs. total
helperRecentvsTotal <- function(df, 
                                xVar, 
                                yVar,
                                title,
                                recencyDays=30, 
                                ablineSlope=0.5, 
                                mapper=varMapper, 
                                printPlot=TRUE
                                ) {
    
    # FUNCTION ARGUMENTS:
    # df: the tibble containing data by state by day
    # xVar: the x-variable
    # yVar: the y-variable
    # title: the plot title
    # recencyDays: number of days to consider as recent
    # ablineSlope: dashed line will be drawn with this slope and intercept 0
    # mapper: mapping file to convert x/y variables to descriptive axes (named vector "variable"="label")
    # printPlot: boolean, whether to display the plot (if FALSE, plot object is returned)
    
    # Get the most date cutoff
    dateCutoff <- df %>% pull(date) %>% max() - recencyDays + 1
    cat("\nRecency is defined as", format(dateCutoff, "%Y-%m-%d"), "through current\n")
    
    # Create the plot
    p1 <- df %>%
        mutate(newCases=ifelse(date >= dateCutoff, cases, 0), 
               newDeaths=ifelse(date >= dateCutoff, deaths, 0), 
               newcpm=ifelse(date >= dateCutoff, cpm, 0), 
               newdpm=ifelse(date >= dateCutoff, dpm, 0)
               ) %>%
        group_by(state, cluster) %>%
        summarize_if(is.numeric, .funs=sum) %>%
        ungroup() %>%
        ggplot(aes_string(x=xVar, y=yVar)) + 
        geom_text(aes(color=cluster, label=state)) + 
        labs(x=mapper[xVar], 
             y=mapper[yVar], 
             title=title, 
             subtitle=paste0("Dashed line represents ", 
                             round(100*ablineSlope), 
                             "% of total is new in last ", 
                             recencyDays,
                             " days"
                             )
             ) + 
        geom_abline(lty=2, slope=ablineSlope) + 
        lims(x=c(0, NA), y=c(0, NA)) + 
        theme(legend.position = "bottom")
    
    if (isTRUE(printPlot)) {
        print(p1)
    } else {
        p1
    }

}


# Function to plot cluster vs. individual elements on a key metric
helperTotalvsElements <- function(df, 
                                  keyVar, 
                                  title,
                                  aggFunc=median, 
                                  mapper=varMapper, 
                                  facetScales="free_y", 
                                  printPlot=TRUE
                                  ) {

    # FUNCTION ARGUMENTS:
    # df: the data frame containing n-day rolling averages
    # keyVar: the variable to be plotted
    # title: the plot title
    # aggFunc: how to aggregate the elements to the segment
    #          CAUTION that this is an aggregate of averages, rather than a population-weighted aggregate
    # mapper: the variable mapping file to get the appropriate label for keyVar
    # facetScales: the scaling for the facets - "free_y" to let them all float, "fixed" to have them the same
    # printPlot: boolean, if TRUE print the plot (otherwise return the plot object)

    # Create an appropriate subtitle
    subtitle <- if(facetScales=="free_y") {
        "Caution that each facet has its own y axis with different scales"
    } else if (facetScales=="fixed") { 
        "All facets are on the same scale"
    } else {
        "Update subtitle code in function helperTotalvsElements"
    }
    
    # Create the plots for segment-level data
    p1 <- df %>%
        rbind(mutate(., state="cluster")) %>%
        group_by(state, cluster, date) %>%
        summarize_at(vars(all_of(keyVar)), .funs=aggFunc) %>%
        ungroup() %>%
        filter(!is.na(get(keyVar))) %>%
        ggplot(aes_string(x="date", y=keyVar)) + 
        geom_line(data=~filter(., state != "cluster"), aes(group=state), alpha=0.25) + 
        geom_line(data=~filter(., state == "cluster"), aes(group=state, color=cluster), lwd=1.5) + 
        facet_wrap(~cluster, scales=facetScales) + 
        labs(x="", 
             y=mapper[keyVar], 
             title=title, 
             subtitle=subtitle,
             caption="Cluster-level aggregates weight each state equally\n(NOT population-weighted)"
             ) + 
        ylim(c(0, NA)) + 
        theme(legend.position="bottom")
    
    # Print plot if requested, otherwise return it
    if (isTRUE(printPlot)) {
        print(p1)
    } else {
        p1
    }
    
}



# Function to assess clusters
assessClusters <- function(clusters, 
                           dfState=stateData, 
                           dfBurden=cvFilteredPerCapita,
                           plotsTogether=FALSE, 
                           clusterPlotsTogether=plotsTogether,
                           recentTotalTogether=plotsTogether, 
                           clusterAggTogether=plotsTogether
                           ) {
    
    # FUNCTION ARGUMENTS:
    # clusters: the named vector containing the clusters by state
    # dfState: the file containing the states and populations
    # plotsTogether: boolean, should plots be consolidated on fewer pages?
    # clusterPlotsTogether: boolean, should plots p1-p4 be consolidated?
    
    # Attach the clusters to the state population data
    dfState <- as.data.frame(clusters) %>%
        set_names("cluster") %>%
        rownames_to_column("state") %>%
        inner_join(dfState, by="state") %>%
        mutate(cluster=factor(cluster))
    
    # Plot the segments on a state map
    p1 <- usmap::plot_usmap(regions="states", data=dfState, values="cluster") + 
        scale_fill_discrete("cluster") + 
        theme(legend.position="right")
    
    # Plot the population totals by segment
    # Show population totals by cluster
    p2 <- dfState %>%
        group_by(cluster) %>%
        summarize(pop=sum(pop)/1000000) %>%
        ggplot(aes(x=fct_rev(cluster), y=pop)) + 
        geom_col(aes(fill=cluster)) + 
        geom_text(aes(y=pop/2, label=round(pop))) + 
        labs(y="2015 population (millions)", x="Cluster", title="Population by cluster (millions)") +
        coord_flip()
    
    # Plot the rolling 7-day mean dialy disease burden by cluster
    dfPlot <- dfState %>%
        inner_join(dfBurden, by="state") %>%
        tibble::as_tibble()
    
    # Plot the rolling 7-day mean dialy disease burden by cluster
    p3 <- dfPlot %>%        
        select(date, cluster, cases=cpm7, deaths=dpm7) %>%
        pivot_longer(-c(date, cluster)) %>%
        filter(!is.na(value)) %>%
        group_by(date, cluster, name) %>%
        summarize(value=median(value)) %>%
        ggplot(aes(x=date, y=value)) +
        geom_line(aes(group=cluster, color=cluster)) +
        facet_wrap(~name, scales="free_y") +
        labs(x="",
             y="Rolling 7-day mean, per million",
             title="Rolling 7-day mean daily disease burden, per million",
             subtitle="Median per day for states assigned to cluster"
             )
    
    # Plot the total cases and total deaths by cluster
    p4 <- dfPlot %>%
        group_by(cluster) %>%
        summarize(cases=sum(cases), deaths=sum(deaths)) %>%
        pivot_longer(-cluster) %>%
        ggplot(aes(x=fct_rev(cluster), y=value/1000)) + 
        geom_col(aes(fill=cluster)) + 
        geom_text(aes(y=value/2000, label=round(value/1000))) +
        coord_flip() + 
        facet_wrap(~varMapper[name], scales="free_x") + 
        labs(x="Cluster", y="Burden (000s)", title="Total cases and deaths by segment")

    # Place the plots together if plotsTogether is TRUE, otherwise just print
    if (isTRUE(plotsTogether)) {
        gridExtra::grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
    } else {
        print(p1); print(p2); print(p3); print(p4)
    }
    
    # Plot total cases and total deaths by state, colored by cluster
    helperBarDeathsCases(dfPlot, 
                         numVars=c("cases", "deaths"), 
                         title="Coronavirus impact by state through August 20, 2020", 
                         xVar=c("state"), 
                         fillVar=c("cluster")
                         )
    
    # Plot cases per million and deaths per million by state, colored by cluster
    helperBarDeathsCases(dfPlot, 
                         numVars=c("cpm", "dpm"), 
                         title="Coronavirus impact by state through August 20, 2020", 
                         xVar=c("state"), 
                         fillVar=c("cluster")
                         )
    
    # Plot last-30 vs total for cases per million by state, colored by cluster
    p7 <- helperRecentvsTotal(dfPlot, 
                              xVar="cpm", 
                              yVar="newcpm", 
                              title="Coronavirus burden through August 20", 
                              printPlot=FALSE
                              )
    
    # Plot last-30 vs total for deaths per million by state, colored by cluster
    p8 <- helperRecentvsTotal(dfPlot, 
                              xVar="dpm", 
                              yVar="newdpm", 
                              title="Coronavirus burden through August 20", 
                              printPlot=FALSE
                              )
    
    # Print the plots either as a single page or separately
    if (isTRUE(recentTotalTogether)) {
        gridExtra::grid.arrange(p7, p8, nrow=1)
    } else {
        print(p7); print(p8)
    }
    
    # Plot the cases per million on a free y-scale and a fixed y-scale
    p9 <- helperTotalvsElements(dfPlot, 
                                keyVar="cpm7", 
                                title="Cases per million, 7-day rolling mean", 
                                printPlot=FALSE
                                )
    p10 <- helperTotalvsElements(dfPlot, 
                                 keyVar="cpm7", 
                                 title="Cases per million, 7-day rolling mean", 
                                 facetScales="fixed", 
                                 printPlot=FALSE
                                 )
    
    # Plot the deaths per million on a free y-scale and a fixed y-scale
    p11 <- helperTotalvsElements(dfPlot, 
                                 keyVar="dpm7", 
                                 title="Deaths per million, 7-day rolling mean", 
                                 printPlot=FALSE
                                 )
    p12 <- helperTotalvsElements(dfPlot, 
                                 keyVar="dpm7", 
                                 title="Deaths per million, 7-day rolling mean", 
                                 facetScales="fixed", 
                                 printPlot=FALSE
                                 )
    
    if (isTRUE(clusterAggTogether)) {
        gridExtra::grid.arrange(p9, p11, nrow=1)
        gridExtra::grid.arrange(p10, p12, nrow=1)
    } else {
        print(p9); print(p10); print(p11); print(p12)
    }
    # Return the plotting data frame
    dfPlot

}

# Check how 5 clusters look, with Vermont arbitrarily reassigned as the same cluster as New Hampshire
clustVec <- cutree(testCluster, k=6)
clustVec["VT"] <- clustVec["NH"]
plotData <- assessClusters(clustVec)

## 
## Recency is defined as 2020-07-22 through current
## 
## Recency is defined as 2020-07-22 through current

At a glance, the segments appear reasonable:

  • Segments 4 and 5 had early disease, but with less pronounced spikes in segment 5 than in segment 4, and with a greater case rebound in segment 5 than in segment 4
  • Segments 2 had late disease, with little spike early
  • Segments 1 and 3 have generally had fewer deaths, though with spiky cases that are similarly shaped if much lower than segment 2 (late disease, which may be increased testing finding a higher percentage of disease)

The full process can then all be run in one place:

# Extract state data
stateData <- getStateData()

# Load and process CV data
cvFull <- readCVData("./RInputFiles/Coronavirus/CV_downloaded_200820.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   state = col_character(),
##   dataQualityGrade = col_character(),
##   lastUpdateEt = col_character(),
##   dateModified = col_datetime(format = ""),
##   checkTimeEt = col_character(),
##   dateChecked = col_datetime(format = ""),
##   hash = col_character(),
##   grade = col_logical()
## )
## See spec(...) for full column specifications.
## Observations: 9,449
## Variables: 53
## $ date                        <date> 2020-08-20, 2020-08-20, 2020-08-20, 20...
## $ state                       <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive                    <dbl> 5332, 112449, 54765, 0, 196280, 644751,...
## $ negative                    <dbl> 307315, 784330, 593744, 1514, 922163, 9...
## $ pending                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ hospitalizedCurrently       <dbl> 51, 1105, 499, NA, 1070, 6212, 238, 47,...
## $ hospitalizedCumulative      <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ inIcuCurrently              <dbl> NA, NA, NA, NA, 388, 1707, NA, NA, 26, ...
## $ inIcuCumulative             <dbl> NA, 1348, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently       <dbl> 6, NA, 108, NA, 233, NA, NA, NA, 12, NA...
## $ onVentilatorCumulative      <dbl> NA, 734, 488, NA, NA, NA, NA, NA, NA, N...
## $ recovered                   <dbl> 1513, 44684, 48458, NA, 28471, NA, 5759...
## $ dataQualityGrade            <chr> "A", "B", "A", "C", "A+", "B", "A", "B"...
## $ lastUpdateEt                <chr> "8/20/2020 0:00", "8/20/2020 11:00", "8...
## $ dateModified                <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ checkTimeEt                 <chr> "8/19/2020 20:00", "8/20/2020 7:00", "8...
## $ death                       <dbl> 29, 1974, 641, 0, 4684, 11686, 1800, 44...
## $ hospitalized                <dbl> NA, 13330, 3790, NA, 21143, NA, 6784, 1...
## $ dateChecked                 <dttm> 2020-08-20 00:00:00, 2020-08-20 11:00:...
## $ totalTestsViral             <dbl> 312647, 891813, 648509, NA, 1116897, 10...
## $ positiveTestsViral          <dbl> 4970, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ negativeTestsViral          <dbl> 307360, NA, 593744, NA, NA, NA, NA, NA,...
## $ positiveCasesViral          <dbl> 5332, 107483, 54765, 0, 194734, 644751,...
## $ deathConfirmed              <dbl> 29, 1905, NA, NA, 4429, NA, NA, 3572, N...
## $ deathProbable               <dbl> NA, 69, NA, NA, 255, NA, NA, 886, NA, 6...
## $ totalTestEncountersViral    <dbl> NA, NA, NA, NA, NA, NA, 895207, NA, NA,...
## $ totalTestsPeopleViral       <dbl> NA, NA, NA, 1514, NA, NA, 645170, NA, 2...
## $ totalTestsAntibody          <dbl> NA, NA, NA, NA, 255456, NA, 150931, NA,...
## $ positiveTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 10406, NA, NA, ...
## $ negativeTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 140525, NA, NA,...
## $ totalTestsPeopleAntibody    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ positiveTestsPeopleAntigen  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsAntigen           <dbl> NA, NA, 10358, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ fips                        <dbl> 2, 1, 5, 60, 4, 6, 8, 9, 11, 10, 12, 13...
## $ positiveIncrease            <dbl> 85, 971, 549, 0, 723, 5920, 270, 118, 5...
## $ negativeIncrease            <dbl> 1713, 10462, 6680, 0, 6481, 81363, 4657...
## $ total                       <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResults            <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ totalTestResultsIncrease    <dbl> 1798, 11433, 7229, 0, 7204, 87283, 7348...
## $ posNeg                      <dbl> 312647, 896779, 648509, 1514, 1118443, ...
## $ deathIncrease               <dbl> 0, 30, 10, 0, 50, 163, 12, 1, 1, 0, 119...
## $ hospitalizedIncrease        <dbl> 0, 250, 47, 0, 123, 0, 3, 72, 0, 0, 450...
## $ hash                        <chr> "c83a1d575a597788adccbe170950b8d197754b...
## $ commercialScore             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## 
## File is unique by: date state and has dimensions: 9449 53
cvFiltered <- processCVData(cvFull)
## 
## 
## Control totals - note that validState other than TRUE will be discarded
## 
## # A tibble: 2 x 4
##   validState   cases deaths     n
##   <lgl>        <dbl>  <dbl> <dbl>
## 1 FALSE        29761    385   790
## 2 TRUE       5516295 165742  8659
cvFilteredPerCapita <- helperMakePerCapita(cvFiltered, k=7)

# Run state-level assessments
assessStateData(cvFilteredPerCapita)

# Test clusters that weight deaths heavily vs. cases and that weight shape more highly than total
testCluster <- clusterStates(cvFilteredPerCapita, 
                             minShape=3, 
                             ratioDeathvsCase = 5, 
                             ratioTotalvsShape = 0.5, 
                             minDeath=100, 
                             minCase=10000
                             )

# Check how 6 clusters look, with Vermont arbitrarily reassigned as the same cluster as New Hampshire
clustVec2 <- cutree(testCluster, k=7)
clustVec2["VT"] <- clustVec2["NH"]

# Create the cluster assessments
plotData2 <- assessClusters(clustVec2)

## 
## Recency is defined as 2020-07-22 through current
## 
## Recency is defined as 2020-07-22 through current

The process is easy to run and update now that it is in functional form. An exploration is made for 6 segments, which allows for a bucket of LA, DC, RI. These are states that had meaningful early disease spikes (though less than the main high-spike-early cluster) and also meaningful late disease spikes (though less than the main high-spike-late cluster). Findings include:

  • There are significant differences (more than 10:1) in coronavirus deaths per capita by US state through August 20
  • There is strong evidence that cases are significantly undercounted in aggregate, since almost all states have a calculated CFR that is well above the current best estimate of 0.5%
  • The observed double-peak (“second wave”) in the US national data appears to be driven by different geographies (segments) having their first wave at different times; there has been a significant increase in testing, so deaths vs. cases in the late first waves are much less than they were in the early first waves
  • While the segments are data-derived, there are significant qualitative similarities among many of the states assigned to a given cluster (e.g., south, Acela corrdior, central/upper plains, etc.)
  • With the exception of segment 5, all segments have between 40 million and 80 million people
  • Segment 2 (south plus AZ/NV) has a meaningful spike in cases and deaths in July/August; the spike in cases appears to be descending rapidly, with deaths plateaued if not decreasing
  • Segment 4 (NY, NJ, CT, MA) has a meaningful spike in cases and deaths in March-May; the spikes appear to have descended to, and remained near, zero thereafter
  • Segment 6 (IL, IN, MI, PA, MD, DE) had a meaningful spike in cases and deaths in March-May, though at under half the per capita levels of Segment 4; there has been a second spike of cases in July/August, though with little increase in deaths, suggestive that there may be increased testing more so than increased disease
  • Segment 5 (LA, DC, RI) is something of a blend of Segment 4 and Segment 6; it has a higher spike early than Segment 6, but has a meaningful spike in cases late unlike Segment 4
  • Segments 1 and 3 both have relatively low coronavirus burden, with segment 3 having greater cases than Segment 1 early, while segment 1 has greater cases than segment 3 late (Segment 1 is roughly a low-impact version of Segment 2, while Segment 3 is roughly a low-impact version of Segment 6)

The clusterStates() function is updated in two ways:

  1. Designed to output, at user’s request, a list containing the processed state-level data as well as the clustering object
  2. Designed to implement kmeans segmentation in addition to hierarchical clustering

An example is run using k-means, with 2 segments (the most obvious best silhouette width given these parameters):

# Test clusters that weight deaths heavily vs. cases and that weight shape more highly than total
# Using kmeans and testing for 1-10 clusters
testCluster_km2 <- clusterStates(cvFilteredPerCapita, 
                                 minShape=3, 
                                 ratioDeathvsCase = 5, 
                                 ratioTotalvsShape = 0.5, 
                                 minDeath=100, 
                                 minCase=10000, 
                                 hierarchical=FALSE, 
                                 nCenters=2,
                                 testCenters=1:10, 
                                 iter.max=20,
                                 nstart=10, 
                                 returnList=TRUE, 
                                 seed=2008261350
                                 )

## 
## Cluster means and counts
##                 1     2
## .           12.00 39.00
## totalCases   0.75  0.57
## totalDeaths  3.67  1.02
## cases_3      0.06  0.02
## deaths_3     0.14  0.12
## cases_4      0.34  0.09
## deaths_4     2.02  0.95
## cases_5      0.24  0.11
## deaths_5     1.72  0.99
## cases_6      0.10  0.14
## deaths_6     0.61  0.71
## cases_7      0.15  0.33
## deaths_7     0.34  1.06
## cases_8      0.10  0.19
## deaths_8     0.18  0.90
# Check how 2 clusters look
clustVec_km2 <- testCluster_km2$objCluster$cluster

# Create the cluster assessments
plotData_km2 <- assessClusters(clustVec_km2)

## 
## Recency is defined as 2020-07-22 through current
## 
## Recency is defined as 2020-07-22 through current

Given the criteria that deaths matter much more than cases and that aggregate matters more than shape, the main clustering distinction is the 11 states plus DC that had early, heavy disease. While this produces the best mean silhouette width, it appears to be missing the distinction of states with a later spike. The elbow plot is consistent with this, as there is no obvious break where within-sum-squares meaningfully stops decreasing. Suppose that 5 segments are created, with the intent of splitting high/low deaths and early/late spikes:

# Test clusters that weight deaths heavily vs. cases and that weight shape more highly than total
# Using kmeans and testing for 1-10 clusters
testCluster_km5 <- clusterStates(cvFilteredPerCapita, 
                                 minShape=3, 
                                 ratioDeathvsCase = 5, 
                                 ratioTotalvsShape = 0.5, 
                                 minDeath=100, 
                                 minCase=10000, 
                                 hierarchical=FALSE, 
                                 nCenters=5,
                                 testCenters=1:10, 
                                 iter.max=20,
                                 nstart=10, 
                                 returnList=TRUE, 
                                 seed=2008261400
                                 )

## 
## Cluster means and counts
##                 1     2    3     4    5
## .           10.00 14.00 9.00 14.00 4.00
## totalCases   0.28  0.86 0.71  0.50 0.78
## totalDeaths  0.41  1.42 2.73  0.99 5.32
## cases_3      0.03  0.01 0.05  0.02 0.10
## deaths_3     0.27  0.08 0.13  0.06 0.15
## cases_4      0.07  0.06 0.26  0.12 0.50
## deaths_4     1.27  0.63 1.73  1.00 2.58
## cases_5      0.06  0.07 0.24  0.19 0.23
## deaths_5     0.69  0.66 1.77  1.50 1.59
## cases_6      0.08  0.18 0.11  0.15 0.07
## deaths_6     0.42  0.61 0.70  1.01 0.44
## cases_7      0.23  0.44 0.21  0.28 0.06
## deaths_7     0.70  1.60 0.42  0.81 0.18
## cases_8      0.16  0.22 0.14  0.19 0.04
## deaths_8     0.70  1.38 0.25  0.61 0.06
# Check how 5 clusters look
clustVec_km5 <- testCluster_km5$objCluster$cluster

# Create the cluster assessments
plotData_km5 <- assessClusters(clustVec_km5)

## 
## Recency is defined as 2020-07-22 through current
## 
## Recency is defined as 2020-07-22 through current

The clusters appear very similar to those created using hierarchical clustering. A comparison of the segments assigned is made:

tibble::tibble(state=names(clustVec), 
               hier5=clustVec, 
               hier6=clustVec2,
               km2=clustVec_km2,
               km5=clustVec_km5
               ) %>%
    count(hier6, hier5, km2, km5)
## # A tibble: 10 x 5
##    hier6 hier5   km2   km5     n
##    <int> <int> <int> <int> <int>
##  1     1     1     2     1     6
##  2     1     1     2     2     6
##  3     1     1     2     4     1
##  4     2     2     2     2     8
##  5     3     3     2     1     4
##  6     3     3     2     4    13
##  7     4     4     1     5     4
##  8     5     5     1     3     3
##  9     6     5     1     3     5
## 10     6     5     2     3     1

States are sufficiently differentiated, and the method sufficiently focused on deaths and aggregates, such that the clustering techniques produce similar results. There are many states that are near the edges of the clusters, and the choice of metrics and even random seeds will drive their assignments. Provided there are enough segments, there appears to typically be 1) at least one segment of early and heavy disease, 2) at least one segment of late and heavy disease, and 3) at least one segment of much lower than average disease. There is then some differentiation as to how the “early and heavy” and “lower than average” segments are identified and/or further subsetted.

The assessClusters() function is updated to put smaller versions of related plots all on a single page. Example usage is shown below:

# Create the cluster assessments
plotData_km5 <- assessClusters(clustVec_km5, 
                               dfState=stateData, 
                               dfBurden=cvFilteredPerCapita,
                               plotsTogether=TRUE
                               )

## 
## Recency is defined as 2020-07-22 through current
## 
## Recency is defined as 2020-07-22 through current

Hospitalization data is also included in the raw coronavirus file from The COVID Project:

# All fields contained in the raw CV file
names(cvData)
##  [1] "date"                        "state"                      
##  [3] "positive"                    "negative"                   
##  [5] "pending"                     "hospitalizedCurrently"      
##  [7] "hospitalizedCumulative"      "inIcuCurrently"             
##  [9] "inIcuCumulative"             "onVentilatorCurrently"      
## [11] "onVentilatorCumulative"      "recovered"                  
## [13] "dataQualityGrade"            "lastUpdateEt"               
## [15] "dateModified"                "checkTimeEt"                
## [17] "death"                       "hospitalized"               
## [19] "dateChecked"                 "totalTestsViral"            
## [21] "positiveTestsViral"          "negativeTestsViral"         
## [23] "positiveCasesViral"          "deathConfirmed"             
## [25] "deathProbable"               "totalTestEncountersViral"   
## [27] "totalTestsPeopleViral"       "totalTestsAntibody"         
## [29] "positiveTestsAntibody"       "negativeTestsAntibody"      
## [31] "totalTestsPeopleAntibody"    "positiveTestsPeopleAntibody"
## [33] "negativeTestsPeopleAntibody" "totalTestsPeopleAntigen"    
## [35] "positiveTestsPeopleAntigen"  "totalTestsAntigen"          
## [37] "positiveTestsAntigen"        "fips"                       
## [39] "positiveIncrease"            "negativeIncrease"           
## [41] "total"                       "totalTestResults"           
## [43] "totalTestResultsIncrease"    "posNeg"                     
## [45] "deathIncrease"               "hospitalizedIncrease"       
## [47] "hash"                        "commercialScore"            
## [49] "negativeRegularScore"        "negativeScore"              
## [51] "positiveScore"               "score"                      
## [53] "grade"
# Fields matching to 'hosp' or 'icu' or 'ventilator'
hospVars <- names(cvData) %>% 
    grep(x=., pattern="[Hh]osp|[Ii]cu|[Vv]entilator", value=TRUE) %>% 
    sort()
hospVars
## [1] "hospitalized"           "hospitalizedCumulative" "hospitalizedCurrently" 
## [4] "hospitalizedIncrease"   "inIcuCumulative"        "inIcuCurrently"        
## [7] "onVentilatorCumulative" "onVentilatorCurrently"

Data are investigated for amount of ‘missingness’ by time period:

set.seed(2008281323)

cvData %>%
    select_at(vars(all_of(c("state", "date", hospVars)))) %>%
    sample_n(20) %>%
    purrr::set_names(c("state", "date", 
                       "hosp", "hospCum", "hospCur", "hospInc", 
                       "icuCum", "icuCur", "ventCum", "ventCur"
                       )
                     ) %>%
    arrange(date)
## # A tibble: 20 x 10
##    state date        hosp hospCum hospCur hospInc icuCum icuCur ventCum ventCur
##    <chr> <date>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
##  1 WA    2020-03-07    NA      NA      NA       0     NA     NA      NA      NA
##  2 UT    2020-03-13    NA      NA      NA       0     NA     NA      NA      NA
##  3 CO    2020-03-24    72      72     116      14     NA     NA      NA      NA
##  4 CT    2020-04-10    NA      NA    1562       0     NA     NA      NA      NA
##  5 ME    2020-04-13   124     124      22       4     NA     NA      NA      NA
##  6 AK    2020-04-15    34      34      NA       2     NA     NA      NA      NA
##  7 DE    2020-05-05    NA      NA     284       0     NA     NA      NA      NA
##  8 NJ    2020-05-16    NA      NA    3564       0     NA   1061      NA     846
##  9 MS    2020-05-18  1805    1805     559      32     NA    141      NA      75
## 10 CO    2020-05-28  4196    4196     464      36     NA     NA      NA      NA
## 11 NE    2020-06-23  1234    1234     135      22     NA     NA      NA      NA
## 12 WI    2020-06-29  3407    3407     237      14    747     90      NA      NA
## 13 VT    2020-06-29    NA      NA      13       0     NA     NA      NA      NA
## 14 MP    2020-07-03    NA      NA      NA       0     NA     NA      NA      NA
## 15 WA    2020-07-09  4630    4630     329      48     NA     NA      NA      59
## 16 SD    2020-07-31   824     824      31       9     NA     NA      NA      NA
## 17 ID    2020-08-05   906     906     195      20    260     39      NA      NA
## 18 MO    2020-08-07    NA      NA     930       0     NA     NA      NA     113
## 19 MI    2020-08-12    NA      NA     640       0     NA    185      NA      84
## 20 IA    2020-08-16    NA      NA     271       0     NA     80      NA      34

Missing data appears to be common, and not always reflective of zero. There is at least some directional evidence that the hospitalized currently variables has been coming on line and that the hospitalized increase variable uses 0 for both NA and ‘no increase’. This would be problematic for using any of the variables (potentially other than hospCur) for cross-state comparisons. An analysis is run to see the frequency of NA by variable by date:

# Not NA data
notNADate <- cvData %>%
    filter(state %in% c(state.abb, "DC")) %>%
    select_at(vars(all_of(c("state", "date", hospVars)))) %>%
    mutate(nState=1) %>%
    group_by(date) %>%
    summarize_if(is.numeric, .funs=function(x) { sum(!is.na(x))}) %>%
    ungroup()

# Evolution of Not NA states by time
notNADate %>%
    pivot_longer(-date) %>%
    ggplot(aes(x=date, 
               y=value, 
               group=fct_rev(fct_reorder(name, value, .fun=max)), 
               color=fct_rev(fct_reorder(name, value, .fun=max))
               )
           ) + 
    geom_line(lwd=1) + 
    geom_hline(yintercept=51, lty=2) + 
    labs(x="", y="Number of states with non-NA data", title="Evolution of data availability by metric") + 
    scale_color_discrete("") + 
    scale_x_date(date_breaks="1 months", date_labels="%m")

# Confirmation that hospitalized increase and nState are identical
sum(notNADate$hospitalizedIncrease != notNADate$nState)
## [1] 0

The plots confirm the meaningful gaps in the hospitalization, ICU, and ventilator data. Further, hospitalized increase exists and is non-missing for every case where there is a record (nState), suggesting that this metric has already had a filter such as ifelse(is.na(x), 0, x) applied. The only data that appears to grow from NA to potentially stable is ‘hospitalized currently’, which has become valid in all 51 states as of August.

General data availability by metric is:

# Not NA data
notNADateState <- cvData %>%
    filter(state %in% c(state.abb, "DC")) %>%
    select_at(vars(all_of(c("state", "date", hospVars)))) %>%
    mutate(nState=1, month=lubridate::month(date)) %>%
    group_by(month, state) %>%
    summarize_if(is.numeric, .funs=function(x) { min(!is.na(x))}) %>%
    ungroup()

# Evolution of Not NA states by month
notNADateState %>%
    pivot_longer(-c(state, month)) %>%
    filter(!(name %in% c("nState", "hospitalizedIncrease"))) %>%
    ggplot(aes(y=fct_reorder(state, value, .fun=sum), x=month)) + 
    geom_tile(aes(fill=value)) + 
    labs(x="", y="", title="Evolution of data availability by metric") + 
    scale_fill_continuous("", low="white", high="green") + 
    facet_wrap(~name, nrow=1)

# States missing hospitalizedCurrently as of May 1
notNADateState %>%
    filter(hospitalizedCurrently != 1, month >= 5)
## # A tibble: 17 x 11
##    month state hospitalized hospitalizedCum~ hospitalizedCur~ hospitalizedInc~
##    <dbl> <chr>        <int>            <int>            <int>            <int>
##  1     5 DC               0                0                0                1
##  2     5 FL               1                1                0                1
##  3     5 HI               1                1                0                1
##  4     5 KS               1                1                0                1
##  5     5 NE               0                0                0                1
##  6     5 NV               0                0                0                1
##  7     5 OH               1                1                0                1
##  8     5 OK               1                1                0                1
##  9     5 SC               1                1                0                1
## 10     5 UT               1                1                0                1
## 11     6 FL               1                1                0                1
## 12     6 HI               1                1                0                1
## 13     6 KS               1                1                0                1
## 14     6 NE               0                0                0                1
## 15     7 FL               1                1                0                1
## 16     7 HI               1                1                0                1
## 17     7 KS               1                1                0                1
## # ... with 5 more variables: inIcuCumulative <int>, inIcuCurrently <int>,
## #   onVentilatorCumulative <int>, onVentilatorCurrently <int>, nState <int>

The hospitalized currently metric is fully complete as of August, and mostly complete as of June. Only data from Florida, Hawaii, Kansas, and Nebraska is missing, and all but Nebraska report data in ‘hospitalized’ for those time periods. How does the hospitalized data compare with the hospitalizedCurrently data for FL, HI, and KS?

# Hospitalized comparisons
cvData %>%
    arrange(state, date) %>% 
    group_by(state) %>%
    filter(state %in% c("FL", "HI", "KS"), 
           is.na(lag(hospitalizedCurrently, 10)), 
                 !is.na(lead(hospitalizedCurrently, 5))
           ) %>% 
    select(date, state, contains("hosp")) %>% 
    as.data.frame()
##          date state hospitalizedCurrently hospitalizedCumulative hospitalized
## 1  2020-07-05    FL                    NA                  16201        16201
## 2  2020-07-06    FL                    NA                  16352        16352
## 3  2020-07-07    FL                    NA                  16733        16733
## 4  2020-07-08    FL                    NA                  17068        17068
## 5  2020-07-09    FL                    NA                  17479        17479
## 6  2020-07-10    FL                  6974                  17916        17916
## 7  2020-07-11    FL                  7186                  18341        18341
## 8  2020-07-12    FL                  7542                  18590        18590
## 9  2020-07-13    FL                  8051                  18817        18817
## 10 2020-07-14    FL                  8354                  19201        19201
## 11 2020-07-15    FL                  8217                  19659        19659
## 12 2020-07-16    FL                  9112                  20154        20154
## 13 2020-07-17    FL                  8961                  20526        20526
## 14 2020-07-18    FL                  9144                  20969        20969
## 15 2020-07-19    FL                  9363                  21309        21309
## 16 2020-07-09    HI                    NA                    122          122
## 17 2020-07-10    HI                    NA                    123          123
## 18 2020-07-11    HI                    NA                    125          125
## 19 2020-07-12    HI                    NA                    125          125
## 20 2020-07-13    HI                    NA                    125          125
## 21 2020-07-14    HI                    23                    128          128
## 22 2020-07-15    HI                    31                    133          133
## 23 2020-07-16    HI                    40                    137          137
## 24 2020-07-17    HI                    39                    138          138
## 25 2020-07-18    HI                    39                    139          139
## 26 2020-07-19    HI                    39                    140          140
## 27 2020-07-20    HI                    33                    150          150
## 28 2020-07-21    HI                    46                    150          150
## 29 2020-07-22    HI                    47                    151          151
## 30 2020-07-23    HI                    39                    154          154
## 31 2020-07-20    KS                    NA                   1497         1497
## 32 2020-07-21    KS                    NA                   1497         1497
## 33 2020-07-22    KS                    NA                   1545         1545
## 34 2020-07-23    KS                    NA                   1545         1545
## 35 2020-07-24    KS                    NA                   1596         1596
## 36 2020-07-25    KS                   315                   1596         1596
## 37 2020-07-26    KS                   315                   1596         1596
## 38 2020-07-27    KS                   212                   1644         1644
## 39 2020-07-28    KS                   212                   1644         1644
## 40 2020-07-29    KS                   393                   1700         1700
## 41 2020-07-30    KS                   393                   1700         1700
## 42 2020-07-31    KS                   366                   1751         1751
## 43 2020-08-01    KS                   366                   1751         1751
## 44 2020-08-02    KS                   366                   1751         1751
## 45 2020-08-03    KS                   232                   1782         1782
##    hospitalizedIncrease
## 1                   161
## 2                   151
## 3                   381
## 4                   335
## 5                   411
## 6                   437
## 7                   425
## 8                   249
## 9                   227
## 10                  384
## 11                  458
## 12                  495
## 13                  372
## 14                  443
## 15                  340
## 16                    3
## 17                    1
## 18                    2
## 19                    0
## 20                    0
## 21                    3
## 22                    5
## 23                    4
## 24                    1
## 25                    1
## 26                    1
## 27                   10
## 28                    0
## 29                    1
## 30                    3
## 31                   44
## 32                    0
## 33                   48
## 34                    0
## 35                   51
## 36                    0
## 37                    0
## 38                   48
## 39                    0
## 40                   56
## 41                    0
## 42                   51
## 43                    0
## 44                    0
## 45                   31

Prior to reporting hospitalizedCurrently, it appears that the hopitalized field and hospitalizedCumulative fields were identical for these states. And, hospitalizedIncrease appears to be the change in hospitalizedCumulative, which would be the number of people newly admitted to the hospital on that day (no reduction for any discharges/deaths on that day).

The lack of data will meaningfully complicate any cross-state comparisons, since some states did not report the same metrics (or at all) during times when their state had meaningful disease burden as shown by cases and deaths.